Skip to content
This repository was archived by the owner on May 10, 2024. It is now read-only.

PARQUET-446: Hide Thrift compiled headers and Boost from public API, #include scrubbing#49

Closed
wesm wants to merge 10 commits intoapache:masterfrom
wesm:PARQUET-446
Closed

PARQUET-446: Hide Thrift compiled headers and Boost from public API, #include scrubbing#49
wesm wants to merge 10 commits intoapache:masterfrom
wesm:PARQUET-446

Conversation

@wesm
Copy link
Member

@wesm wesm commented Feb 12, 2016

This is the completion of work I started in PARQUET-442. This also resolves PARQUET-277 as no boost headers are included in the public API anymore.

I've done some scrubbing of #includes using Google's Clang-based include-what-you-use tool. PARQUET-522 can also be resolved when this is merged.

@wesm wesm changed the title PARQUET-446: Hide Thrift and Boost compiled headers from public API PARQUET-446: Hide Thrift and Boost compiled headers from public API, #include scrubbing Feb 13, 2016
@wesm
Copy link
Member Author

wesm commented Feb 13, 2016

OK, I finished a bunch of include-what-you-use cleanup, so we can also resolve PARQUET-522 when this is merged.

Note that the wall time for compiling libparquet.so is about 10% faster. I'm seeing 18.2s before and 16.3s after.


protected:
size_t batch_size_;
int32_t batch_size_;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this int64_t? Why should we limit the batch_size to 2GB of a byte type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make int64_t -- data pages are limited to int32_t but a row group could in theory be larger.

@majetideepak
Copy link

I am little concerned with having our own types for DataPage etc. We now have a bunch of code to construct these data pages and other structures. This code will not be useful for the write path. We will have to again use the set methods of the parquet types to construct objects like parquet::DataPageHeader etc. Can we just use the parquet types to begin with? This way we will also be heading closer to the write path.

@wesm
Copy link
Member Author

wesm commented Feb 13, 2016

@majetideepak Well, we have a hard constraint, which is that exposing Thrift includes in the public API is not acceptable. What I've done is in line with parquet-mr's equivalent code (I didn't actually look until now https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/page/DataPageV1.java)

From a software layering point of view it is better (IMHO) to decouple serialization-deserialization matters as much as possible.

I'm happy of course to look at an alternate proposal that satisfies the constraint that developers using parquet-cpp as a 3rd-party library do not need to include Thrift libraries and headers in their build system.

@wesm
Copy link
Member Author

wesm commented Feb 13, 2016

@julienledem can you review and comment? Thank you

@wesm wesm changed the title PARQUET-446: Hide Thrift and Boost compiled headers from public API, #include scrubbing PARQUET-446: Hide Thrift compiled headers and Boost from public API, #include scrubbing Feb 13, 2016
@majetideepak
Copy link

I see your point with respect to Serializer and Deserializer and parquet-mr.
parquet-cpp should be able to use any serializer and deserializer not just thrift.

// assembled in a serialized stream for storing in a Parquet files

SerializedPageReader::SerializedPageReader(std::unique_ptr<InputStream> stream,
Compression::type codec) :

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we name this ThriftSerializedPageReader since it is specific to thrift? If we want to move on the lines of parquet-mr, I believe there should be further de-coupling from thrift and this Reader. We probably should have another module like parquet-cpp/parquet-thrift.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be done in a further downstream patch but I want to make sure we are going on the right path. @julienledem, @nongli and others, please pitch in your experience with parquet-mr here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could make parquet-cpp similar to parquet-mr/parquet-thrift (if I understood it correctly) with an additional ability to read and write "batch size" columns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this line of reasoning. The Parquet specification uses Thrift for its metadata serialization, which is separate from code that can marshal materialized/reassembled records to and from Thrift or some other RPC/Serialization format (which is the role of parquet-thrift, parquet-avro, and other adapter components in parquet-mr). See for example: https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftToParquetFileWriter.java

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet-thrift in parquet-mr is not related to reading parquet metadata (which happens to use thrift) in parquet-format

We use thrift in 2 places:

  • read the parquet metadata in the file. We want that one encapsulated and invisible to the users of the library. This is why there's a separate object model additionally to the thrift idl file. The thrift IDL is here to make sure we stay backwards compatible and we have a strict spec for it. The other model is more convenient to work with and abstracts thrift out.
  • parquet-thrift is the thrift integration to let users write parquet files using a thrift IDL to define their domain specific schema. It depends on the core parquet components. Not the other way around. It is at the same level as parquet-protobuf, parquet-avro, parquet-pig ...

In particular hiding the thrift version used for the metadata will make sure that it does not conflict with the version used by the end user for their own code generation.
I hope this helps.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm and @julienledem thanks! for the clarification.

@majetideepak
Copy link

+1 LGTM

@julienledem
Copy link
Member

+1 LGTM

@asfgit asfgit closed this in b71e826 Feb 15, 2016
@wesm
Copy link
Member Author

wesm commented Feb 16, 2016

thank you!

@wesm wesm deleted the PARQUET-446 branch February 16, 2016 00:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants